Implementing approximate regularities

نویسندگان

  • Manolis Christodoulakis
  • Costas S. Iliopoulos
  • Kunsoo Park
  • Jeong Seop Sim
چکیده

K e y w o r d s A p p r o x i m a t e regularities, Approximate period, Approximate cover, Approximate seed, Hamming distance, Edit distance, Weighted edit distance, Smallest distance approximation, Restricted smallest distance approximation. 1. I N T R O D U C T I O N Find ing regularities in s t r ings is useful in a wide a rea of app l i ca t ions which involve s t r ing man ipu la t ions , such as molecu la r biology, d a t a compress ion , and c o m p u t e r a s s i s t e d music analysis . Typ ica l regu la r i t i es are r epe t i t ions , per iods , covers, and seeds. In app l i ca t i ons such as molecu la r b io logy and c o m p u t e r a s s i s t e d music analys is , f inding exac t r epe t i t i ons is no t a lways sufficient. A more a p p r o p r i a t e no t ion is t h a t of approximate r epe t i t ions [1-3], where er rors are allowed. In th is paper , we consider th ree different k inds of app rox ima t ion : the Hamming distance, the edit distance, and the weighted edit distance. Sire, I l iopoulos , P a r k and S m y t h showed po lynomia l t ime a lgo r i thms for f inding a p p r o x i m a t e pe r iods [4] and, Sim, Park , K i m and Lee showed po lynomia l t ime a lgo r i thms for the a p p r o x i m a t e covers p rob l em in [5]. More recently, Chr i s todou lak i s , I l iopoulos , P a r k and Sire showed polynomia l t ime a lgor i thms for the a p p r o x i m a t e seeds p rob lem [6]. In th is pape r , we imp lemen t and compare the algorithms given in [4-6]. *This work was supported by MOST Grant M1-0309-06-0003. 0895-7177/05/$ see fi-ont matter @ 2005 Elsevier Ltd. All rights reserved. Typeset by ~42MS-TEX doi:10.1016/j.mcm.2005.09.013 856 M. CHRISTODOULAKIS et al. Table 1. Comparison between periods Periods Covers Seeds What Is Covered Right extension of the text The text itself Left and right extension of the text covers, and seeds. How It Is Covered Concatenations Concatenations and overlaps Concatenations and overlaps 2. P R E L I M I N A R I E S 2.1. Dis tance Functions We call the distance 5(x, y) between two strings x and y, the min imum cost to t ransform one string x to the other string y. The special symbol A denotes the absence of a character (i.e., an insertion or a deletion occurs). The edit or Levenshtein distance between two strings is the min imum number of edit operations t ha t t r ans fo rm one string into another. The edit opera t ions are insertion, deletion and substitution, each of cost 1. The Hammin9 distance between two strings is the min imum number of substitutions tha t t ransform one string to the other. Note tha t the H a m m i n g distance can be defined only when the two str ings have the same length, because it does not allow insertions and deletions. We also consider a generalized version of the edit distance model, the weighted edit distance, where each insertion, deletion, and subst i tu t ion has a different cost, s tored in a penalty matrix. 2.2. Approximate Regularit ies Here, we give the definitions of the approximate periods, covers, and seeds. These definitions are expressed in a different way from the corresponding original definitions given in [4-6]. This is done in order to expose their similarities and provide us with the "background" common to all three of them. Let x and s be strings over E*, c~ be a distance function, t be an integer, and sl , s 2 , . . . , s~ ( s i ¢ ~) be strings, such tha t 5(s, si) _< t, for 1 < i < r. DEFINITION 1. s iS a t-approximate period of x i f and only i f there exists a superstring y xv (right extension) o f x that can be constructed by concatenating copies of the strings Sl, s2, . . . , s~. DEFINITION 2. s is a t-approximate cover o f x if and only i f x can be constructed by overlapping or concatenating copies of the str ings s l, s 2 , . . . , s~. DEFINITION 3. S iS a t-approximate seed of x if and only i f there exists a superstring y = uxv (right and left extensions) of x that can be constructed by overlapping or concatenating copies of the strings sl , s2,. • •, s~. 3 . P R O B L E M D E F I N I T I O N S A N D S O L U T I O N S 3.1. Smallest Distance Approximate P e r i o d / C o v e r / S e e d Prob lem DEFINITION 4. Let x be a string of length n, s be a string of length m, and 5 be a distance function. The smallest distance approximate period/cover/seed problem is to find the min imum integer t such that s is a t-approximate period/cover/seed of x. There are two steps involved to solve this problem. 1. Compu te the distance between s and every substr ing of x. Let wij be the distance between s and x [ i . . . j ] , for 1 < i < j _< n, t ha t is, wij = 5(x[ i . . . j], s). Section 3.1.1 explains in detail how these w~j are computed . Implementing Approximate Regularities 857 2. Compute the minimum t such that s is a t -approximate per iod/cover /seed of x. Let ti be the minimum value such tha t s is a t i -approximate per iod/cover /seed of x [1 . . , i]. Initially, to = 0. For i = 1 to n, where n is the length of the text x, we compute t i= min { m a x ~ min {tj},wh+l#}}. h m i n <:[ h ~ h . . . . . [h~j~_j . . . . . The value t~ is the minimum t such tha t s is a t -approximate per iod/cover /seed of x. The values of hmin, hraax, jmax depend on the regularity (period, cover, or seed), we are computing and on the distance function we are using, and will be explained in Section 3.1.2. 3 . 1 . 1 S t e p 1" T h e d i s t a n c e b e t w e e n s a n d e v e r y s u b s t r i n g o f x. This step resolves the mat ter of what is covered, as described in the definitions of approximate periods, covers, and seeds. The method we use to compute the distance between s and x[i. . . j] , for 1 < i G j < n, depends on the distance function we are using. For the Hamming distance, tha t is the characterby-character comparison. For the edit and weighted edit distance, we use dynamic programming. To compute the distance between two strings, x and y, a dynamic programming table, called the D-table, of size ( I x l + l ) x ( l Y I + I ) , is used. Each e n t r y D [ i , j ] , 0 < i < Ixl, and 0 G j <__ lYl, stores the minimum cost of transforming x [1 . . , i] to y [1 . . , j]. Initially, D [0, 0] = 0, D [i, 0] = D [i 1, 0] + 5 (x Ill, A ) , and D [0,j] = D [ 0 , j 1] + (a,y [j]). Then, we can compute all the entries of the D table in O(Ix I lYl) t ime by the following recurrence, D[i 1,j], +5(x[i], A), D[i,j] = min D[i,j 1], +5(A,y[ j ] ) , D [ i 1 , j 1], +5(x[i],y[j]), where 5(a, b) is the cost of substituting character a with character b, 5(a, A) is the cost of deleting a and 5(A, a) is the cost of inserting a. Covering a left extension of the text x, can be seen as aligning a suffix of the pa t te rn s, s[k.., m], with a prefix of the text, x [1 . . , j], as shown in Figure 1. Therefore, in this case, we need to redefine Wlj, instead of being the distance between x [ 1 . . . j ] and s, to be the minimum distance between x [1 . . , j] and any suffix of the pat tern, s[k.., rn]. Formally, (A) wi j = minl<k<m Similarly, covering a right extension of the text x, can be seen as aligning a prefix of the pat tern s, s [1 . . , k], with a suffix of the text, x[i. . , n], as shown in Figure 1. Therefore, in this case, we redefine wi~, instead of being the distance between x[i. . , n] and s, to be the minimum distance between x[i. . , n] and any prefix of the pattern, s [1 . . , k]. Formally, (B) win = minl<k<m {(~(x[i... n], s [1 . . , k])}. s.. ...... j [

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Approximate Geometric Regularities

Current reverse engineering systems are able to generate simple valid boundary representation (B-rep) models from 3D range data. Such models suffer from various inaccuracies caused by noise in the input data and algorithms. Reverse engineered geometric models may be beautified by finding approximate geometric regularities in such a model, and imposing a suitable subset of them on the model by u...

متن کامل

Approximate Periods of Strings for Absolute Distances

Approximate periods of strings can be used to find approximate repetitive regularities in strings. In this paper we consider the approximate period problem for different absolute distances.

متن کامل

Finding Approximate Shape Regularities for Reverse Engineering

Current reverse engineering systems can generate boundary representation (B-rep) models from 3D range data. Such models suffer from inaccuracies caused by noise in the input data and algorithms. The quality of reverse engineered geometric models can be improved by finding candidate shape regularities in such a model, and constraining the model to meet a suitable subset of them, in a post-proces...

متن کامل

The Generalized Approximate Regularities in Strings

We concentrate on the generalized string regularities and study the minimum approximate λ-cover problem and the minimum approximate λ-seed problem of a string. Given a string x of length n and an integer λ, the minimum approximate λ-cover (resp. seed) problem is to find a set of λ substrings each of equal length that covers x (resp. a superstring of x) with the minimum error, under a variety of...

متن کامل

Partial Approximate Symmetry Detection of Geometric Model

Engineering geometric models are often designed to have symmetries and other regularities. In knowledge based reuse, creative design and design for mass customization, to have the information of such symmetries and other regularities from a geometric model is very useful. And this can make us understand more about the geometric model. In reverse engineering, B-rep models are created by fitting ...

متن کامل

Classification-based Approximate Policy Iteration: Experiments and Extended Discussions

Tackling large approximate dynamic programming or reinforcement learning problems requires methods that can exploit regularities, or intrinsic structure, of the problem in hand. Most current methods are geared towards exploiting the regularities of either the value function or the policy. We introduce a general classification-based approximate policy iteration (CAPI) framework, which encompasse...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Mathematical and Computer Modelling

دوره 42  شماره 

صفحات  -

تاریخ انتشار 2005